AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customer queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.
To identify different segments in the existing customer, based on their spending patterns as well as past interaction with the bank, using clustering algorithms, and provide recommendations to the bank on how to better market to and service these customers.
The data provided is of various customers of a bank and their financial attributes like credit limit, the total number of credit cards the customer has, and different channels through which customers have contacted the bank for any queries (including visiting the bank, online and through a call center).
Sl_No: Primary key of the recordsCustomer Key: Customer identification numberAverage Credit Limit: Average credit limit of each customer for all credit cardsTotal credit cards: Total number of credit cards possessed by the customerTotal visits bank: Total number of visits that customer made (yearly) personally to the bankTotal visits online: Total number of visits or online logins made by the customer (yearly)Total calls made: Total number of calls made by the customer to the bank or its customer service department (yearly)# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
!pip install pandas-profiling==1.1.0
import pandas_profiling
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
import seaborn as sns
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.datasets import make_blobs
# to visualize the elbow curve and silhouette scores
!pip install yellowbrick
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to compute distances
from scipy.spatial.distance import pdist
# to perform PCA
from sklearn.decomposition import PCA
# Pandas dataframe options
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_colwidth',400)
# set the background for the graphs
plt.style.use('ggplot')
# For pandas profiling
from pandas_profiling import ProfileReport
# Printing style
from tabulate import tabulate
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: pandas-profiling==1.1.0 in /opt/anaconda3/lib/python3.9/site-packages (1.1.0) Requirement already satisfied: pandas>=0.16 in /opt/anaconda3/lib/python3.9/site-packages (from pandas-profiling==1.1.0) (1.4.4) Requirement already satisfied: matplotlib>=1.4 in /opt/anaconda3/lib/python3.9/site-packages (from pandas-profiling==1.1.0) (3.5.2) Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (9.2.0) Requirement already satisfied: numpy>=1.17 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (1.21.5) Requirement already satisfied: pyparsing>=2.2.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (3.0.9) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (1.4.2) Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (21.3) Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (4.25.0) Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (0.11.0) Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.9/site-packages (from pandas>=0.16->pandas-profiling==1.1.0) (2022.1) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=1.4->pandas-profiling==1.1.0) (1.16.0) Requirement already satisfied: yellowbrick in /opt/anaconda3/lib/python3.9/site-packages (1.5) Requirement already satisfied: cycler>=0.10.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (0.11.0) Requirement already satisfied: scipy>=1.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.9.1) Requirement already satisfied: numpy>=1.16.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.21.5) Requirement already satisfied: scikit-learn>=1.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.0.2) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (3.5.2) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.2) Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2) Requirement already satisfied: pyparsing>=2.2.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9) Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (9.2.0) Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (21.3) Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.25.0) Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.1.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0) Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
!pip install openpyxl
Requirement already satisfied: openpyxl in /opt/anaconda3/lib/python3.9/site-packages (3.0.10) Requirement already satisfied: et_xmlfile in /opt/anaconda3/lib/python3.9/site-packages (from openpyxl) (1.1.0)
df = pd.read_excel('Credit+Card+Customer+Data.xlsx')
data = df.copy()
print('There are {row} records, and {col} columns in the dataset'.format(row=data.shape[0], col=data.shape[1]))
There are 660 records, and 7 columns in the dataset
data.isnull().sum()
Sl_No 0 Customer Key 0 Avg_Credit_Limit 0 Total_Credit_Cards 0 Total_visits_bank 0 Total_visits_online 0 Total_calls_made 0 dtype: int64
There are no missing values in the dataset
data.duplicated().sum()
0
There are no duplicate records
data.sample(10)
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 63 | 64 | 91673 | 14000 | 3 | 0 | 4 | 6 |
| 76 | 77 | 90168 | 17000 | 1 | 2 | 4 | 6 |
| 241 | 242 | 81878 | 10000 | 4 | 5 | 1 | 3 |
| 412 | 413 | 92140 | 31000 | 6 | 5 | 2 | 1 |
| 609 | 610 | 11562 | 38000 | 4 | 3 | 2 | 0 |
| 322 | 323 | 23881 | 8000 | 6 | 3 | 1 | 2 |
| 554 | 555 | 97850 | 33000 | 5 | 2 | 2 | 0 |
| 617 | 618 | 98216 | 136000 | 8 | 0 | 13 | 0 |
| 560 | 561 | 35268 | 46000 | 6 | 3 | 1 | 0 |
| 546 | 547 | 44157 | 68000 | 7 | 2 | 1 | 4 |
data.dtypes
Sl_No int64 Customer Key int64 Avg_Credit_Limit int64 Total_Credit_Cards int64 Total_visits_bank int64 Total_visits_online int64 Total_calls_made int64 dtype: object
All the attributes are integers
There are only one possible attribute for key, the Customer Key. We'll first convert the Sl_No to index, and then check the Customer Key. Since this data is to be used for customer segmentation, findig the customer key is essential.
I have already checked this in excel, hence depicting the same in the notebook. Considering the small size of the dataset, is it really convenient to check a few initial things in the excel file itself first.
data = data.set_index(['Sl_No'])
data.index.max()
660
data['Customer Key'].nunique()
655
There appears to be 5 duplicates in the customer key
data[data['Customer Key'].isin(data[data['Customer Key'].duplicated()]['Customer Key'].tolist())].sort_values('Customer Key')
| Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|
| Sl_No | ||||||
| 49 | 37252 | 6000 | 4 | 0 | 2 | 8 |
| 433 | 37252 | 59000 | 6 | 2 | 1 | 2 |
| 5 | 47437 | 100000 | 6 | 0 | 12 | 3 |
| 333 | 47437 | 17000 | 7 | 3 | 1 | 0 |
| 412 | 50706 | 44000 | 4 | 5 | 0 | 2 |
| 542 | 50706 | 60000 | 7 | 5 | 2 | 2 |
| 392 | 96929 | 13000 | 4 | 5 | 0 | 0 |
| 399 | 96929 | 67000 | 6 | 2 | 2 | 2 |
| 105 | 97935 | 17000 | 2 | 1 | 2 | 10 |
| 633 | 97935 | 187000 | 7 | 1 | 7 | 0 |
The records for same Customer Key look really different from each other. I am assuming this is either mistake in the Customer Key assignment, or we are missing
current_version_indicatorin the dataset. As of now, I am going to consider these as two different customers. After the clustering, I will analyze the groups corresponding to these sets of records
def feature_name_standardize(df: pd.DataFrame):
df_ = df.copy()
df_.columns = [i.replace(" ", "_").lower() for i in df_.columns]
return df_
data = feature_name_standardize(data)
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| customer_key | 660.000 | 55141.444 | 25627.772 | 11265.000 | 33825.250 | 53874.500 | 77202.500 | 99843.000 |
| avg_credit_limit | 660.000 | 34574.242 | 37625.488 | 3000.000 | 10000.000 | 18000.000 | 48000.000 | 200000.000 |
| total_credit_cards | 660.000 | 4.706 | 2.168 | 1.000 | 3.000 | 5.000 | 6.000 | 10.000 |
| total_visits_bank | 660.000 | 2.403 | 1.632 | 0.000 | 1.000 | 2.000 | 4.000 | 5.000 |
| total_visits_online | 660.000 | 2.606 | 2.936 | 0.000 | 1.000 | 2.000 | 4.000 | 15.000 |
| total_calls_made | 660.000 | 3.583 | 2.865 | 0.000 | 1.000 | 3.000 | 5.000 | 10.000 |
The first step of univariate analysis is to check the distribution/spread of the data. This is done using primarily histograms and box plots. Additionally we'll plot each numerical feature on violin plot and cumulative density distribution plot. For these 4 kind of plots, we are building below summary() function to plot each of the numerical attributes. Also, we'll display feature-wise 5 point summary.
def summary(data: pd.DataFrame, x: str):
'''
The function prints the 5 point summary and histogram, box plot,
violin plot, and cumulative density distribution plots for each
feature name passed as the argument.
Parameters:
----------
data: pd.Datafraame, the dataset
x: str, feature name
Usage:
------------
summary(data, 'age')
'''
x_min = data[x].min()
x_max = data[x].max()
Q1 = data[x].quantile(0.25)
Q2 = data[x].quantile(0.50)
Q3 = data[x].quantile(0.75)
dict={'Min': x_min, 'Q1': Q1, 'Q2': Q2, 'Q3': Q3, 'Max': x_max}
df = pd.DataFrame(data=dict, index=['Value'])
print(f'5 Point Summary of {x.capitalize()} Attribute:\n')
print(tabulate(df, headers = 'keys', tablefmt = 'psql'))
fig = plt.figure(figsize=(16, 8))
plt.subplots_adjust(hspace = 0.6)
sns.set_palette('Pastel1')
plt.subplot(221, frameon=True)
ax1 = sns.distplot(data[x], color = 'purple')
ax1.axvline(
np.mean(data[x]), color="purple", linestyle="--"
) # Add mean to the histogram
ax1.axvline(
np.median(data[x]), color="black", linestyle="-"
) # Add median to the histogram
plt.title(f'{x.capitalize()} Density Distribution')
plt.subplot(222, frameon=True)
ax2 = sns.violinplot(x = data[x], palette = 'Accent', split = True)
plt.title(f'{x.capitalize()} Violinplot')
plt.subplot(223, frameon=True, sharex=ax1)
ax3 = sns.boxplot(x=data[x], palette = 'cool', width=0.7, linewidth=0.6, showmeans=True)
plt.title(f'{x.capitalize()} Boxplot')
plt.subplot(224, frameon=True, sharex=ax2)
ax4 = sns.kdeplot(data[x], cumulative=True)
plt.title(f'{x.capitalize()} Cumulative Density Distribution')
plt.show()
print('We will check the summary of below columns: \n', data.columns.tolist())
We will check the summary of below columns: ['customer_key', 'avg_credit_limit', 'total_credit_cards', 'total_visits_bank', 'total_visits_online', 'total_calls_made']
summary(data, 'avg_credit_limit')
5 Point Summary of Avg_credit_limit Attribute: +-------+-------+-------+-------+-------+--------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+-------+-------+-------+--------| | Value | 3000 | 10000 | 18000 | 48000 | 200000 | +-------+-------+-------+-------+-------+--------+
The attribute is
right skewedwith a lot ofoutliers
summary(data, 'total_credit_cards')
5 Point Summary of Total_credit_cards Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 1 | 3 | 5 | 6 | 10 | +-------+-------+------+------+------+-------+
The attribute is fairly normally distributed with a few spikes
summary(data, 'total_visits_bank')
5 Point Summary of Total_visits_bank Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 1 | 2 | 4 | 5 | +-------+-------+------+------+------+-------+
The data is slightly
right skewed
summary(data, 'total_visits_online')
5 Point Summary of Total_visits_online Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 1 | 2 | 4 | 15 | +-------+-------+------+------+------+-------+
The data is
right skewedand has someoutliersto the right
summary(data, 'total_calls_made')
5 Point Summary of Total_calls_made Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 1 | 3 | 5 | 10 | +-------+-------+------+------+------+-------+
The data is
right skewed
Creating a function to plot labeled bar plot of the features, with percentage label on data bars.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 3))
else:
plt.figure(figsize=(n + 1, 3))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Creating a credit card limit bin out of the available data in the avg credit limit feature
data['cc_spending_bin'] = pd.cut(data['avg_credit_limit'], bins=[3000, 10000, 18000, 48000, 200000]
,labels=['Very Low', 'Low', 'Mid', 'High'], include_lowest=True)
labeled_barplot(data, 'cc_spending_bin', perc=True)
labeled_barplot(data, 'total_credit_cards', perc=True)
labeled_barplot(data, 'total_visits_bank', perc=True)
labeled_barplot(data, 'total_visits_online', perc=True)
labeled_barplot(data, 'total_calls_made', perc=True)
plt.figure(figsize=(20,20));
sns.set(palette="Set2");
sns.pairplot(data.iloc[:, 1:], diag_kind='kde', corner=True);
<Figure size 2000x2000 with 0 Axes>
# Plotting correlation heatmap of the features
mask = np.zeros_like(data.iloc[:, 1:].corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.set(rc={"figure.figsize": (8, 8)})
sns.heatmap(
data.iloc[:, 1:].corr(),
cmap=sns.diverging_palette(20, 220, n=200),
annot=True,
mask=mask,
center=0,
)
plt.show()
Average Credit Limithas slightly positive correlation withTotal Credit Cards(obviously) andTotal Number of Visits Online, and slightly negative correlation withTotal Calls MadeTotal Credit CardsandTotal Calls Madeare negatively correlatedTotal Visits Onlineis also negatively correlated withTotal Visits to Bank
# Function to plot numerical feature by each category with target hue
def plot_numeric_by_cat(data: pd.DataFrame, category_columns: list, numeric_column: str, hue: str = None):
'''
The function plots a numerical feature in box plot by every category column specified in the list,
with hue of a target category
'''
num_cols = 2
num_rows = int(len(category_columns) /2 + 1)
plt.figure(figsize=(20, 8*num_rows))
for i, col in enumerate(category_columns):
plt.subplot(num_rows, num_cols, i+1)
sns.set(palette="nipy_spectral");
sns.boxplot(data=data, x=col, y=numeric_column, hue=hue, showfliers=True).set(title = numeric_column + ' vs. ' + col );
plot_numeric_by_cat(data\
,category_columns=['total_credit_cards', 'total_visits_bank', 'total_visits_online', 'total_calls_made']\
,numeric_column='avg_credit_limit')
We can see clear segmentations with respect to each pair of features
Before clustering, we should always scale the data, because, different scales of features would result in unintentional importance to the feature of higher scale while calculating the distances.
# scaling the dataset before clustering
data_copy = data.copy()
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.iloc[:,1:-1])
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.
clusters = range(1, 10)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(data_scaled)
prediction = model.predict(data_scaled)
distortion = (
sum(
np.min(cdist(data_scaled, model.cluster_centers_, "euclidean"), axis=1)
)
/ data_scaled.shape[0]
)
meanDistortions.append(distortion)
plt.plot(clusters, meanDistortions, "bo-")
plt.xlabel("k")
plt.ylabel("Average distortion")
plt.title("Selecting k with the Elbow Method")
for x,y in zip(clusters, meanDistortions):
label = "{:.2f}".format(y)
plt.annotate(label, # this is the text
(x + 0.2,y), # these are the coordinates to position the label
textcoords="offset points", # how to position the text
xytext=(0,10), # distance from text to points (x,y)
ha='center') # horizontal alignment can be left, right or center
plt.show()
Appropriate value for k seems to be 3
Let's check the silhouette scores.
A score closer to 1 indicates that the data point is very similar to other data points in the cluster,
A score closer to -1 indicates that the data point is not similar to the data points in its cluster.
sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((data_scaled))
# centers = clusterer.cluster_centers_
score = silhouette_score(data_scaled, preds)
sil_score.append(score)
plt.plot(cluster_list, sil_score, "bo-")
plt.xlabel("k")
plt.ylabel("Silhouette Score ")
plt.title("Selecting k using Silhouette Score")
for x,y in zip(cluster_list, sil_score):
label = "{:.2f}".format(y)
plt.annotate(label, # this is the text
(x + 0.2,y), # these are the coordinates to position the label
textcoords="offset points", # how to position the text
xytext=(0,10), # distance from text to points (x,y)
ha='center') # horizontal alignment can be left, right or center
plt.show()
Silhouette score for 3 clusters is highest. So, we will choose 3 as value of k.
Let's also visualize the silhouettes created by each of the clusters for two values of K, 3 and 4
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(data_scaled)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 660 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(data_scaled)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 660 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
Clearly, 3 clusters seem very reasonable for this case study
# let's take 3 as number of clusters
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(data_scaled)
KMeans(n_clusters=3, random_state=0)
# adding kmeans cluster labels to the original and scaled dataframes
data_scaled_df = pd.DataFrame(data_scaled, columns=data.iloc[:,1:-1].columns.unique().tolist())
data['cluster'] = kmeans.labels_
data_scaled_df['cluster'] = kmeans.labels_
plt.figure(figsize=(20, 26))
sns.pairplot(data_scaled_df, diag_kind='kde', corner=True, hue='cluster', palette='Set2');
<Figure size 2000x2600 with 0 Axes>
- It appears, the
method of contacting the bank (In Person/ Online/ Call)drives the clustering mechanism predominantly, we'll explore this in below plot
import plotly as py
import plotly.graph_objs as go
trace1 = go.Scatter3d(
x= data['total_visits_bank'],
y= data['total_visits_online'],
z= data['total_calls_made'],
mode='markers',
marker=dict(
color = data['cluster'],
size= 20,
line=dict(
color= data['cluster'],
width= 12
),
opacity=0.8
),
)
d = [trace1]
layout = go.Layout(
title= 'Clusters',
scene = dict(
xaxis = dict(title = 'Bank Visits'),
yaxis = dict(title = 'Online Visits'),
zaxis = dict(title = 'Call Customer Service')
)
)
fig = go.Figure(data=d, layout=layout)
py.offline.iplot(fig)
sns.set(style='dark')
sns.scatterplot(x=data['avg_credit_limit'], y=data['total_credit_cards'], hue=data['cluster'], palette=['blue', 'red', 'orange'])
<AxesSubplot:xlabel='avg_credit_limit', ylabel='total_credit_cards'>
cluster_profile = data.iloc[:,1:].groupby('cluster').mean()
cluster_profile['count_of_customers'] = data.groupby('cluster')['customer_key'].count()
cluster_profile.style.highlight_max(color='lightgreen').highlight_min(color='pink')
| avg_credit_limit | total_credit_cards | total_visits_bank | total_visits_online | total_calls_made | count_of_customers | |
|---|---|---|---|---|---|---|
| cluster | ||||||
| 0 | 33782.383420 | 5.515544 | 3.489637 | 0.981865 | 2.000000 | 386 |
| 1 | 12174.107143 | 2.410714 | 0.933036 | 3.553571 | 6.870536 | 224 |
| 2 | 141040.000000 | 8.740000 | 0.600000 | 10.900000 | 1.080000 | 50 |
data_scaled_df.boxplot(by='cluster', layout=(3,2), figsize=(10,14));
- If we look at the data we see that there is a group which prefers online interactions with their bank, they have a much higher credit limit and also have more credit cards (cluster - 2).
- The customers who prefer in-person interactions tend to have the mid-range of credit cards and credit limit (cluster - 0).
- The customers who contact via phonecall are in another segment, who have lowest credit limit and number of cards (cluster - 1).
Before starting clustering we'll remove the cluster column from the dataset.
k_means_clusters = data_scaled_df['cluster']
data_scaled_df.drop(columns=['cluster'], inplace=True)
I am going to try many distance metrics and linkage methods to find the best combination.
The value should be very close to 1 for a high-quality solution. This measure can be used to compare alternative cluster solutions obtained using different algorithms.
# list of distance metrics
distance_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation',
'cosine', 'euclidean', 'hamming', 'jaccard', 'mahalanobis', 'matching', 'minkowski', 'seuclidean', 'sqeuclidean']
# list of linkage methods
linkage_methods = ['complete', 'average', 'single', 'weighted']
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(data_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(data_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Braycurtis distance and complete linkage is 0.5082233975395475. Cophenetic correlation for Braycurtis distance and average linkage is 0.6052152783444972. Cophenetic correlation for Braycurtis distance and single linkage is 0.8502845518924395. Cophenetic correlation for Braycurtis distance and weighted linkage is 0.531818336516011. Cophenetic correlation for Canberra distance and complete linkage is 0.46069443891594597. Cophenetic correlation for Canberra distance and average linkage is 0.6751521558776901. Cophenetic correlation for Canberra distance and single linkage is 0.7566178713003985. Cophenetic correlation for Canberra distance and weighted linkage is 0.5351232761249246. Cophenetic correlation for Chebyshev distance and complete linkage is 0.8533474836336782. Cophenetic correlation for Chebyshev distance and average linkage is 0.8974159511838106. Cophenetic correlation for Chebyshev distance and single linkage is 0.7382354769296767. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.8913624010768603. Cophenetic correlation for Cityblock distance and complete linkage is 0.8731477899179829. Cophenetic correlation for Cityblock distance and average linkage is 0.896329431104133. Cophenetic correlation for Cityblock distance and single linkage is 0.7252379350252723. Cophenetic correlation for Cityblock distance and weighted linkage is 0.8825520731498188. Cophenetic correlation for Correlation distance and complete linkage is 0.48880336464362406. Cophenetic correlation for Correlation distance and average linkage is 0.6601199957880307. Cophenetic correlation for Correlation distance and single linkage is 0.6102702201143325. Cophenetic correlation for Correlation distance and weighted linkage is 0.5317441799993322. Cophenetic correlation for Cosine distance and complete linkage is 0.5316900096793689. Cophenetic correlation for Cosine distance and average linkage is 0.6720803242310263. Cophenetic correlation for Cosine distance and single linkage is 0.7473448481710088. Cophenetic correlation for Cosine distance and weighted linkage is 0.5660201015465876. Cophenetic correlation for Euclidean distance and complete linkage is 0.8599730607972423. Cophenetic correlation for Euclidean distance and average linkage is 0.8977080867389372. Cophenetic correlation for Euclidean distance and single linkage is 0.7391220243806552. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8861746814895477. Cophenetic correlation for Hamming distance and complete linkage is 0.20179683067025023. Cophenetic correlation for Hamming distance and average linkage is 0.5473662896769675. Cophenetic correlation for Hamming distance and single linkage is 0.7080851663862526. Cophenetic correlation for Hamming distance and weighted linkage is 0.6057204454928228. Cophenetic correlation for Jaccard distance and complete linkage is 0.20179683067025023. Cophenetic correlation for Jaccard distance and average linkage is 0.5473662896769675. Cophenetic correlation for Jaccard distance and single linkage is 0.7080851663862526. Cophenetic correlation for Jaccard distance and weighted linkage is 0.6057204454928228. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.5422791209801747. Cophenetic correlation for Mahalanobis distance and average linkage is 0.8326994115042134. Cophenetic correlation for Mahalanobis distance and single linkage is 0.7058064784553606. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.7805990615142516. Cophenetic correlation for Matching distance and complete linkage is 0.20179683067025023. Cophenetic correlation for Matching distance and average linkage is 0.5473662896769675. Cophenetic correlation for Matching distance and single linkage is 0.7080851663862526. Cophenetic correlation for Matching distance and weighted linkage is 0.6057204454928228. Cophenetic correlation for Minkowski distance and complete linkage is 0.8599730607972423. Cophenetic correlation for Minkowski distance and average linkage is 0.8977080867389372. Cophenetic correlation for Minkowski distance and single linkage is 0.7391220243806552. Cophenetic correlation for Minkowski distance and weighted linkage is 0.8861746814895477. Cophenetic correlation for Seuclidean distance and complete linkage is 0.8599730607972426. Cophenetic correlation for Seuclidean distance and average linkage is 0.8977080867389373. Cophenetic correlation for Seuclidean distance and single linkage is 0.7391220243806551. Cophenetic correlation for Seuclidean distance and weighted linkage is 0.8861746814895477. Cophenetic correlation for Sqeuclidean distance and complete linkage is 0.8820964814996479. Cophenetic correlation for Sqeuclidean distance and average linkage is 0.8783309583061251. Cophenetic correlation for Sqeuclidean distance and single linkage is 0.7284473846766318. Cophenetic correlation for Sqeuclidean distance and weighted linkage is 0.893679734763713.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.8977080867389373, which is obtained with average linkage.
We see that the cophenetic correlation is maximum with Euclidean distance and Average Linkage`
A dendrogram, inWe will use a dendrogram to figure out the best way to allocate to clusters
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(data_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
- The
cophenetic correlationishighest for average linkage methods.3appears to be the appropriatenumber of clustersfrom the dendrogram for average linkage.
Also known as bottom-up approach or hierarchical agglomerative clustering (HAC). Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all data.
HCmodel = AgglomerativeClustering(n_clusters=3, affinity="euclidean", linkage="average")
HCmodel.fit(data_scaled_df)
AgglomerativeClustering(linkage='average', n_clusters=3)
data_scaled_df['hc_cluster'] = HCmodel.labels_
data['hc_cluster'] = HCmodel.labels_
cluster_profile_hc = data.iloc[:,1:].groupby('hc_cluster').mean()
cluster_profile_hc['count_of_customers'] = data.groupby('hc_cluster')['customer_key'].count()
cluster_profile_hc.style.highlight_max(color='lightgreen').highlight_min(color='pink')
| avg_credit_limit | total_credit_cards | total_visits_bank | total_visits_online | total_calls_made | cluster | count_of_customers | |
|---|---|---|---|---|---|---|---|
| hc_cluster | |||||||
| 0 | 33713.178295 | 5.511628 | 3.485788 | 0.984496 | 2.005168 | 0.002584 | 387 |
| 1 | 141040.000000 | 8.740000 | 0.600000 | 10.900000 | 1.080000 | 2.000000 | 50 |
| 2 | 12197.309417 | 2.403587 | 0.928251 | 3.560538 | 6.883408 | 1.000000 | 223 |
data_scaled_df.boxplot(by='hc_cluster', layout=(3,2), figsize=(10,14));
- If we look at the data we see that there is a group which prefers online interactions with their bank, they have a much higher credit limit and also have more credit cards (cluster - 1).
- The customers who prefer in-person interactions tend to have the mid-range of credit cards and credit limit (cluster - 0).
- The customers who contact via phonecall are in another segment, who have lowest credit limit and number of cards (cluster - 2).
data[data['customer_key'].isin(data[data['customer_key'].duplicated()]['customer_key'].tolist())].sort_values('customer_key')
| customer_key | avg_credit_limit | total_credit_cards | total_visits_bank | total_visits_online | total_calls_made | cc_spending_bin | cluster | hc_cluster | |
|---|---|---|---|---|---|---|---|---|---|
| Sl_No | |||||||||
| 49 | 37252 | 6000 | 4 | 0 | 2 | 8 | Very Low | 1 | 2 |
| 433 | 37252 | 59000 | 6 | 2 | 1 | 2 | High | 0 | 0 |
| 5 | 47437 | 100000 | 6 | 0 | 12 | 3 | High | 2 | 1 |
| 333 | 47437 | 17000 | 7 | 3 | 1 | 0 | Low | 0 | 0 |
| 412 | 50706 | 44000 | 4 | 5 | 0 | 2 | Mid | 0 | 0 |
| 542 | 50706 | 60000 | 7 | 5 | 2 | 2 | High | 0 | 0 |
| 392 | 96929 | 13000 | 4 | 5 | 0 | 0 | Low | 0 | 0 |
| 399 | 96929 | 67000 | 6 | 2 | 2 | 2 | High | 0 | 0 |
| 105 | 97935 | 17000 | 2 | 1 | 2 | 10 | Low | 1 | 2 |
| 633 | 97935 | 187000 | 7 | 1 | 7 | 0 | High | 2 | 1 |
If we consider the duplicate records are actually updated records for the same customer, then it can be observed that 3 of the 5 customers have actually changed their clusters/groups. It appears, providing credit limit increase, or turning the customers to digital banking customers, we can actually move the customers to a more desirable and profitable cluster.
Although there are only 5 dimensions, it'll be really cool to be able to visualize the clusters at 3 dimensional space without loosing much of the information. Let's use PCA to reduce the dimensions so that 90% of the variance in the data is explained.
data_scaled_df_pca = data_scaled_df.drop('hc_cluster', axis=1).copy()
pca = PCA()
pca.fit(data_scaled_df_pca)
PCA()
Let's check the variance explained by individual components.
pca.explained_variance_ratio_
array([0.45736578, 0.37426483, 0.06401348, 0.0555688 , 0.04878711])
# visualizing the variance explained by individual principal components
sns.set(style='darkgrid')
plt.figure(figsize=(10, 10))
plt.plot(
range(1, 6), pca.explained_variance_ratio_.cumsum(), marker="o", linestyle="--"
)
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
Text(0, 0.5, 'Cumulative Explained Variance')
For 90% variance explained, the number of components looks to be 3.
pca = PCA(
n_components=3, svd_solver="full"
) # svd_solver=full helps in faster convergence in case of very large data set
pca.fit(data_scaled_df_pca)
PCA(n_components=3, svd_solver='full')
# checking the variance explained by individual components.
print('Explained variance = {var} %'.format(var=round(pca.explained_variance_ratio_.sum(),2)*100))
Explained variance = 90.0 %
plt.figure(figsize=(10, 10))
plt.plot(
range(1, 4), pca.explained_variance_ratio_.cumsum(), marker="o", linestyle="--"
)
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
Text(0, 0.5, 'Cumulative Explained Variance')
subset_pca = pca.transform(data_scaled_df_pca)
subset_pca_df = pd.DataFrame(subset_pca)
sns.pairplot(subset_pca_df, diag_kind='kde');
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_pca_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_pca_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations
hc = AgglomerativeClustering(n_clusters=3, affinity="euclidean", linkage="complete")
hc_labels = hc.fit_predict(subset_pca)
# adding hierarchical cluster labels to the original dataframe
data_pca = data.drop(['cluster', 'hc_cluster'], axis=1)
data_pca['pca_hc_cluster'] = hc_labels
data_scaled_pca_df = data_scaled_df.drop('hc_cluster', axis=1)
data_scaled_pca_df['pca_hc_cluster'] = hc_labels
cluster_profile2 = data_pca.groupby('pca_hc_cluster').mean()
cluster_profile2['customer_count'] = (
data_pca.groupby('pca_hc_cluster')['customer_key'].count().values
)
# let's display cluster profile
cluster_profile2.style.highlight_max(color="lightgreen", axis=0)
| customer_key | avg_credit_limit | total_credit_cards | total_visits_bank | total_visits_online | total_calls_made | customer_count | |
|---|---|---|---|---|---|---|---|
| pca_hc_cluster | |||||||
| 0 | 54951.709512 | 33591.259640 | 5.496144 | 3.467866 | 0.987147 | 2.002571 | 389 |
| 1 | 56708.760000 | 141040.000000 | 8.740000 | 0.600000 | 10.900000 | 1.080000 | 50 |
| 2 | 55120.814480 | 12217.194570 | 2.402715 | 0.936652 | 3.579186 | 6.932127 | 221 |
pca_df = subset_pca_df.copy()
pca_df['pca_hc_cluster'] = hc_labels
pca_df = pca_df.rename(columns={0:'component_1', 1:'component_2', 2:'component_3'})
sns.pairplot(pca_df, diag_kind='kde', hue='pca_hc_cluster');
import plotly as py
import plotly.graph_objs as go
trace1 = go.Scatter3d(
x= pca_df['component_1'],
y= pca_df['component_2'],
z= pca_df['component_3'],
mode='markers',
marker=dict(
color = pca_df['pca_hc_cluster'],
size= 20,
line=dict(
color= pca_df['pca_hc_cluster'],
width= 12
),
opacity=0.8
),
)
d = [trace1]
layout = go.Layout(
title= 'Clusters',
scene = dict(
xaxis = dict(title = 'Component 1'),
yaxis = dict(title = 'Component 2'),
zaxis = dict(title = 'Component 3')
)
)
fig = go.Figure(data=d, layout=layout)
py.offline.iplot(fig)
data_scaled_pca_df.boxplot(by='pca_hc_cluster', layout=(3,2), figsize=(10,14));
There are three significant types of customer: